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AVOIDING  COMMUNICATION  IN  THE  LANCZOS 
BIDIAGONALIZATION  ROUTINE  AND  ASSOCIATED  LEAST 
SQUARES  QR  SOLVER 

ERIN  CARSON 

Abstract.  Communication  -  the  movement  of  data  between  levels  of  memory  hierarchy  or 
between  processors  over  a  network  -  is  the  most  expensive  operation  in  terms  of  both  time  and 
energy  at  all  scales  of  computing.  Achieving  scalable  performance  in  terms  of  time  and  energy 
thus  requires  a  dramatic  shift  in  the  field  of  algorithmic  design.  Solvers  for  sparse  linear  algebra 
problems,  ubiquitous  throughout  scientific  codes,  are  often  the  bottlenecks  in  application  perfor¬ 
mance  due  to  a  low  computation/communication  ratio.  In  this  paper  we  develop  three  potential 
implementations  of  communication-avoiding  Lanczos  bidiagonalization  algorithms  and  discuss  their 
different  computational  requirements.  Based  on  these  new  algorithms,  we  also  show  how  to  obtain 
a  communication-avoiding  LSQR  least  squares  solver. 


1.  Introduction.  Classical  implementations  of  Krylov  methods,  Lanczos  bidi¬ 
agonalization  methods  included,  require  one  or  more  sparse  matrix-vector  multipli¬ 
cations  (SpMVs)  and  one  or  more  inner  product  operations  in  each  iteration.  These 
computational  kernels  are  both  communication-bound  on  modern  computer  architec¬ 
tures.  To  perform  an  SpMV,  each  processor  must  communicate  entries  of  the  source 
vector  it  owns  to  other  processors  in  the  parallel  algorithm,  and  in  the  sequential 
algorithm  the  matrix  A  must  be  read  from  slow  memory  (when  it  is  too  large  to  fit  in 
cache,  the  most  interesting  case).  Inner  products  involve  a  global  reduction  (see  [21, 
§11.4])  in  the  parallel  algorithm,  and  a  number  of  reads  and  writes  to  slow  memory 
in  the  sequential  algorithm  (depending  on  the  size  of  the  vectors  and  the  size  of  the 
fast  memory). 

Thus,  many  efforts  have  focused  on  communication-avoiding  Krylov  subspace 
methods  (CA-KSMs),  or  s-step  Krylov  methods,  which  can  perform  s  iterations  with 
a  factor  of  O(s)  less  communication  than  classical  KSMs;  see,  e.g.,  [3,  4,  6,  8,  9,  11, 
14,  15,  22,  23].  In  practice,  this  can  translate  into  significant  speedups  for  many 
problems  [18,  24].  In  this  paper,  we  will  use  the  terminology  ‘s-step  methods’,  which 
was  introduced  in  [5] .  The  reader  should  note  this  use  of  the  term  differs  from  other 
works,  e.g.,  [7,  16]  and  [13,  §9.2.7],  in  which  the  term  ‘s-step  methods’  is  used  to  refer 
to  a  type  of  restarted  Lanczos  procedure. 

In  this  manuscript  we  present  three  approaches  to  developing  communication- 
avoiding  variants  of  Lanczos  bidiagonalization.  Each  of  the  three  approaches  are  used 
to  give  both  communication-avoiding  upper  and  lower  bidiagonalization  routines.  The 
LSQR  least  squares  solver  of  Paige  and  Saunders  [19]  is  based  on  the  Lanczos  lower 
bidiagonalization  method,  and  we  use  this  to  derive  two  potential  CA-LSQR  methods 
based  on  two  of  our  approaches  to  communication-avoiding  lower  bidiagonalization. 

The  rest  of  this  manuscript  is  organized  as  follows.  Section  2  gives  a  brief  overview 
of  the  communication-avoiding  approach  and  motivates  such  methods  from  a  perfor¬ 
mance  perspective.  In  Section  3,  we  review  the  upper  and  lower  Lanczos  bidiago¬ 
nalization  procedures.  In  Section  4  we  demonstrate  three  approaches  to  developing 
communication-avoiding  versions  of  the  algorithms  in  Section  3.  Section  5  first  re¬ 
views  the  LSQR  method  and  then  gives  algorithms  for  two  possible  communication- 
avoiding  variants.  Section  6  briefly  discusses  future  work,  namely,  convergence  and 
performance  studies  to  compare  these  new  methods  and  determine  guidelines  for  when 
one  should  be  used  over  another. 
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2.  Background  on  communication-avoiding  Krylov  methods.  The  basic 
idea  behind  the  CA-KSMs  introduced  by  Hoennnen,  Mohiyuddin,  and  others  (see 
[15]),  is  to  unroll  the  iteration  loop  by  a  factor  of  s  >  1.  One  first  builds  bases  for  the 
Krylov  subspaces  known  to  contain  the  iteration  vectors  to  be  computed  in  the  next  s 
iterations,  computes  Gram  matrices  to  store  dot  products  between  these  basis  vectors, 
and  then  subsequently  performs  s  iterations,  updating  the  coordinates  of  the  iteration 
vectors  in  the  precomputed  bases  rather  than  the  iteration  vectors  themselves.  For  a 
thorough  treatment  of  communication-avoiding  Krylov  methods,  see  [1] . 

This  algorithmic  change  allows  use  of  communication-avoiding  kernels  which  can 
asymptotically  reduce  communication  cost.  The  matrix  powers  kernel  optimization 
fuses  together  a  sequence  of  s  SpMV  operations  into  one  kernel  invocation.  This  kernel 
is  used  to  compute  the  0(s)-dinrensional  Krylov  bases,  which  are  denoted  by  calli¬ 
graphic  letters  in  this  and  future  sections.  Depending  on  the  nonzero  structure  of  A  (or 
whatever  matrix  we  must  compute  a  basis  for),  this  enables  communication-avoidance 
in  both  serial  and  parallel  implementations,  as  described  in  paragraphs  below.  For  an 
in-depth  treatment  of  the  matrix  powers  kernel  implementation,  see  [10]. 

Serial.  In  serial,  the  matrix  powers  kernel  reorganizes  the  0{s )  SpMVs  to  max¬ 
imize  reuse  of  A  and  the  vector(s).  This  means  ideally  reading  A  and  the  starting 
vector  only  once  and  writing  the  s  output  vectors  spanning  the  Krylov  subspace  only 
once.  When  the  communication  cost  of  reading  A  dominates  that  of  reading/ writing 
the  vectors  (a  common  situation),  this  results  in  an  s-folcl  decrease  in  both  latency 
(the  number  of  messages  sent)  and  bandwidth  (the  number  of  words  moved). 

Parallel.  In  a  parallel  implementation,  the  matrix  powers  kernel  reorganizes  the 
computation  in  a  similar  way  but  with  a  slightly  different  goal.  In  a  parallel  SpMV 
operation,  only  entries  of  the  vectors  need  to  be  communicated.  The  parallel  matrix 
powers  kernel  avoids  interprocessor  synchronization  by  initially  storing  some  redun¬ 
dant  elements  of  A  and  the  starting  vector  on  different  processors  and  performing 
redundant  computation  to  compute  the  s  Krylov  basis  vectors  without  further  syn¬ 
chronization  in  between  SpMVs.  Provided  the  additional  bandwidth  and  latency 
cost  to  distribute  the  starting  vector  is  a  lower-order  term  (equivalently,  As  is  well 
partitioned,-,  see  [10])  this  gives  an  s-fold  savings  in  latency  cost. 

Serial  and  parallel  variants  of  the  matrix  powers  kernel,  for  both  structured  and 
general  sparse  matrices,  are  described  in  [17]  and  [1],  which  summarize  most  of  [10] 
and  elaborate  on  the  implementation  in  [18].  Within  [17],  we  refer  the  reader  to 
the  complexity  analysis  in  Tables  2.3-4,  the  performance  modeling  in  §2.6,  and  the 
performance  results  in  §2.10.3  and  §2.11.3,  which  demonstrate  that  this  optimization 
leads  to  speedups  in  practice. 

For  example,  for  a  2D  five-point  stencil  on  a  i fn  x  yfn  mesh  with  p  processors, 
assuming  s  <C  \fnjp ,  the  number  of  arithmetic  operations  grows  by  a  factor  1  + 
2 Syjpjn,  the  number  of  messages  decreases  by  a  factor  of  s/2,  and  the  number  of 
words  moved  grows  by  a  factor  of  1  +  (s/2 )y/p/n  [17].  Therefore  since  the  additional 
arithmetic  operations  and  additional  words  moved  are  lower  order  terms,  we  expect 
to  see  a  0(s)  speedup  when  latency  is  the  dominant  cost.  We  note  that  matrix  powers 
kernel  performance  is  sensitive  to  matrix  structure  and  hardware  parameters,  making 
it  a  good  candidate  for  inclusion  in  auto-tuning  libraries  and  specializers. 

Besides  SpMV  operations,  classical  KSMs  also  must  compute  inner  products  in 
each  iteration,  which  incur  a  costly  global  synchronization  on  parallel  computers.  For 
Lanczos-based  methods  like  the  ones  in  this  paper,  inner  products  are  computed  as 
a  block  operation  producing  Gram  matrices  which  are  later  used  to  compute  dot 
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products  without  additional  communication.  In  parallel,  this  can  lead  to  an  s-fold 
decrease  in  latency. 

These  communication-avoiding  variants  can  lead  to  speedups  in  practice.  We 
direct  the  reader  to  recent  performance  results  in  [24],  which  demonstrate  speedups 
up  to  4.2  for  a  communication-avoiding  BICGSTAB  implementation  with  s  =  4. 

3.  The  bidiagonalization  algorithm.  We  first  review  classical  algorithms  for 
reduction  of  a  matrix  to  both  upper  bidiagonal  and  lower  bidiagonal  form.  The 
original  procedure  given  by  Golub  and  Kahan  gives  the  procedure  as  a  reduction  to 
upper  bidiagonal  form  [12].  With  some  slight  modifications,  Paige  and  Saunders  [19] 
showed  that  a  similar  procedure  could  be  used  to  produce  a  reduction  to  lower  bidi¬ 
agonal  form,  and  that  this  formulation  was  more  amenable  to  solving  the  full-rank 
least  squares  problem  min  ||  Ax  —  6||2-  This  observation  forms  the  basis  for  the  LSQR 
algorithm.  Both  methods  are  connected  in  that  they  both  produce  the  same  sequence 
of  vectors  14  that  would  be  produced  by  the  symmetric  Lanczos  method  applied  to 
ATA. 

Let  A  be  an  m-by-n  matrix  and  b  be  a  length-m  vector.  After  k  iterations, 
the  Lanczos  upper  bidiagonalization  procedure  produces  the  rn-by-k  matrix  Pi-  = 
[PhV2,  ■  ■  ■ ,  Pk]  and  the  n-by-fc  matrix  14  =  [f  i ,  V2,  ■  ■  ■ ,  t'fe]  such  that 

Vfc(^iei)  =  ATb 
AVk  =  PkRk 

AT Pk  =  VkRk  +  Sk+iVk+i^k  i 

where 


Rk 


Pi  02 

P2  03 

Pk—1  0k 
Pk 


and  in  exact  arithmetic,  P^ Pk  =  I  and  PAT  V/c  =  I.  The  algorithm  of  Golub  and  Kahan 
for  reduction  to  upper  bidiagonal  form  is  shown  in  Algorithm  1.  Note  that  here  and  in 
the  remainder  of  this  paper  bars  over  variables  denote  intermediate  quantities  which 
are  yet  to  be  normalized.  We  note  that  one  can  formulate  the  upper  bidiagonalization 
algorithm  as  the  Lanczos  reduction  to  tridiagonal  form.  Letting 


•Z'  =  [21,22,  •  ■  • ,Z2k ]  = 


0  Pl  0  p2 
V\  0  V2  0 


0  Pk 
Vk  0 


0  A  ' 
AT  0  ’ 


and  T  = 


0  pi 
pi  0  02 

02  0  P2 

P2  0 

'  •  0  k 

0k  0  Pk 
Pk  0 
3 


the  procedure  in  Algorithm  1  is  mathematically  equivalent  to 

AZ  =  ZT  +  0k+lZ2k+1^2k^ 

with  ZH Z  =  /2fc  and  ZH Z2k+i  =  0.  This  means  that,  in  exact  arithmetic,  k  steps  of 
the  upper  bidiagonalization  procedure  applied  to  A  with  starting  vector  Vi  produces 
the  same  information  as  2k  steps  of  symmetric  Lanczos  applied  to  cyclic  matrix  A 
with  starting  vector  Z\  as  defined  above. 


Algorithm  1  Lanczos  reduction  to  upper  bidiagonal  form 
Require:  m-by-n  matrix  A  and  length-n  vector  b 
l:  9i  =  ||At6||2,  vi  =  ATb/9i,  pi  =  Av i,  pi  =  ||pi||2,  Pi  =  Pi/pi 
2:  for  i  =  1,2,...  until  convergence  do 
3:  Vi+i  =  ATpi  -  piVi 

4:  9i+ 1  =  ||2 

5:  Vi+ 1  =  Vi+i/9i+i 

6:  pi+i  =  Avi+ 1  -  9i+iPi 

T-  Pi+1  =  l|Pi+l  1 1 2 

8:  Pi  +  1  =  Pi+1  /  Pi  +  1 

9:  end  for 


We  will  also  consider  reduction  to  lower  bidiagonal  form  for  the  purpose  of  easy 
connection  to  the  LSQR  method  of  Paige  and  Saunders  [19].  Again,  A  is  an  m-by-n 
matrix  and  b  is  a  lengtli-m  vector.  After  k  iterations,  the  Lanczos  lower  bidiagonal¬ 
ization  procedure  produces  the  m-by-(fc  + 1)  matrix  Uk+i  =  [iq,  U2,  •  •  • ,  Ufc+i]  and  the 
n-by-k  matrix  14  =  [fi,  V2,  ■  ■  ■ ,  Vk\  such  that 


Uk+iiPiei)  —  b 

AVk  =  Uk+iBk 

ATUk+i  =  VkBTk  +  ak+iUk+iek+i, 


where 


Bk 


Oti 

P2  012 

P3 


Oik 

Pk+l 


and  in  exact  arithmetic,  U^+1Uk+i  =  I  and  17AT Vk  =  I. 

Again  in  this  case,  we  can  formulate  the  lower  bidiagonalization  algorithm  as  the 
Lanczos  reduction  to  tridiagonal  form.  Here  we  define 


Z  —  [Zi,  Z2,  ...  ,  Z2k] 


0  V\  0  i>2 
Ml  0  U2  0 


0  vk 
uk  0  ’ 
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A  = 


0  AT 
A  0 


and  T  = 


0 

Oi  i 

Oil 

0 

02 

^2 

0 

02 

02 

0 

Pk 


Pk 

0 

Oik 


oik 
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and  then  Algorithm  2  is  mathematically  equivalent  to 

AZ  =  ZT  +  pk+\Z2k+\0^ki 


with  ZH Z  =  I 2k  and  ZH Z2k+i  =  0.  Then  in  exact  arithmetic,  k  steps  of  the  lower 
bidiagonalization  procedure  applied  to  A  with  starting  vector  u\  produces  the  same 
information  as  2k  steps  of  symmetric  Lanczos  applied  to  cyclic  matrix  A  with  starting 
vector  z\  as  defined  above. 


Algorithm  2  Lanczos  reduction  to  lower  bidiagonal  form 
Require:  m-by-n  matrix  A  and  length-n  starting  vector  b 

i:  Pi  =  II&II2,  «i  =  b/p i,  vi  =  ATu1,  ax  =  ||vi||2,  «i  =  n/ax 

2:  for  «  =  1,2,...  until  convergence  do 
3:  Ui+ 1  =  AVi  -  UiUi 

4:  Pi+ 1  =  Hib+llh 

5:  Ui+  x  =  Ui+x/Pi+1 

6:  Vi+1  =  ATUi+1  -  Pi+xVi 

7:  cq:+ 1  =  ||m*+i||2 

8:  Vi+1  =  Mj+l/ojj+l 

9:  end  for 


4.  Communication-avoiding  Lanczos  bidiagonalization.  There  are  at  least 
three  ways  to  derive  communication-avoiding  variants  of  Algorithms  1  and  2,  each  with 
associated  pros  and  cons.  The  correct  method  to  choose  will  depend  on  the  structure 
and  conditioning  of  the  matrix,  the  requirements  of  the  particular  application,  and 
machine-specific  parameters  such  as  cache  size  and  relative  latency/bandwidth  cost. 
We  describe  the  three  potential  communication-avoiding  variants  in  subsections  below. 

4.1.  Equivalent  form  of  CA-Lanczos.  As  discussed  in  Section  3,  k  steps  of 
either  bidiagonalization  procedure  in  Algorithm  1  or  2  will  produce  the  same  infor¬ 
mation  as  2k  steps  of  symmetric  Lanczos  applied  to  the  appropriately  defined  cyclic 
matrix  A  and  appropriately  chosen  starting  vector  z\.  Therefore  one  can  simply 
use  an  existing  version  of  CA-Lanczos  (available  in,  e.g.,  [15,  1,  2])  run  on  input  A 
and  2 1,  and  recover  the  bidiagonalization  matrices,  either  P&,  14,  and  Rk  for  upper 
bidiagonalization,  or  U^+x,  14,  and  Bj,  for  lower  bidiagonalization,  from  Z  and  T. 

This  method  is  simple  and  allows  us  to  use  an  existing  communication-avoiding 
method.  The  drawback  is  that  the  system  is  now  twice  the  size,  and  extra  work  and 
storage  will  be  required  unless  the  Lanczos  method  is  modified  to  optimize  for  the 
block  non-zero  structure  of  the  matrix/vectors. 

4.2.  Forming  Krylov  bases.  By  introducing  auxiliary  quantities,  yet  another 
version  can  be  derived  that  works  by  building  s-step  Krylov  bases  with  AAT  and 
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AT A.  The  benefit  here  is  that  other  polynomial  bases  can  be  used  in  order  to  im¬ 
prove  numerical  properties  (e.g.,  Newton  or  Chebyshev).  The  drawbacks  are  that  this 
requires  computing  bases  with  powers  of  AAT  and  AT A,  which  squares  the  condition 
number  of  A.  Also,  in  order  to  satisfy  the  recurrences,  we  need  4s  +  1  basis  vectors 
in  each  iteration,  which  doubles  the  number  of  SpMVs  per  s  iterations  versus  the 
classical  method  (this  is  assuming  we  form  and  store  AT  A  and  AAT  offline;  otherwise 
the  number  of  SpMVs  required  is  8s  +  2).  We  note  that  this  is  equivalent  (in  exact 
arithmetic)  to  the  method  described  in  the  previous  subsection,  but  takes  nonzero 
blocks  into  account  and  uses  auxiliary  quantities. 

We  derive  this  method  below  for  both  upper  and  lower  bidiagonalization  proce¬ 
dures.  Note  that  in  communication-avoiding  algorithms,  we  will  switch  from  indexing 
iterations  by  i  to  indexing  iterations  by  sk  +  j,  where  s  is  the  iteration  blocking 
parameter,  k  is  the  outer  iteration  index,  and  j  in  the  inner  iteration  index. 

4.2.1.  Reduction  to  upper  bidiagonal  form.  Assume  we  are  beginning  it¬ 
eration  sk  +  1  of  Algorithm  1,  where  k  £  N  and  0  <  s  €  N,  so  that  vsk+ 1  and  psk+ i 
have  just  been  computed.  Recall  that 

Psk+j-\- 1  £  ^s+ i  ( AA  ,psk~ t-i)  +  Afs(AA  ,  Avsk~\-i)  and 
Vsk+j-\- 1  £  fcs(A  A,Vsk- l-l)  V  VS(A  A,  A  Psfc+l); 

for  j  £  {0, . . . ,  s}. 

We  define  basis  matrices  whose  columns  span  these  subspaces  as  follows.  Let 
Vk  be  a  basis  for  ICS(AT A,vsk+ 1),  Vfc  a  basis  for  /CS(AAT,  Avsk+i),  Vk  a  basis  for 
/Cs+i(AAT,psfc+i)  and  Vk  a  basis  for  K,S(AT A,  AT psk+ 1).  Assuming  these  polyno¬ 
mial  bases  are  generated  using  a  three-term  recurrence,  we  can  write  the  recurrence 
relations 


(AAT)[2fe,0,  Vfc,0] 
(ATA)[Vfe,0,2fc,0] 


[Vk,Vk] 


[Vk,Vk] 


pf\0] 

0 

Pf\  0] 

0 


0 

pf\o]_ 

0 

pf\o]_ 


and 


where  Vk ,  Vfc,  Vfc,  and  Vk  are  the  same  as  Vk ,  1-4,  Vk,  and  Pk,  resp.,  but  with  the 
last  column  removed,  and  the  Tk  matrices  are  tridiagonal  matrices  of  the  form 


ai  /3i 
7i 


Pi- 1 


ot-i 

7 i  . 


(4.1) 


where  above  i  =  s  for  and  i  =  s  —  1  for  T^V\  T^\  and  Note  that  the 

entries  otj ,  7 j ,  and  Pj  can  be  set  differently  depending  on  whether  we  are  constructing 
polynomials  in  AAT  or  AT A.  Also  note  that  these  recurrence  coefficients  could  be 
refined  with  each  new  outer  loop.  See  Philippe  and  Reichel  [20]  for  guidelines  on 
setting  these  entries  such  that  the  basis  condition  number  is  improved. 
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To  simplify  notation,  we  will  define  34  =  [Pk,Vk],  y_k  =  [2fc,0,Vfc,0],  Zk  = 
[Vfc,iPfc],  Zfe  =  [Vfc,0,Pfe,0],  and 

r(w  =  [[rf),o]  o  i  ^  =  r^iv),o]  0 

k  [  0  piV),0]J’  k  [  0  [tP,  0] 

This  lets  us  rewrite  the  recurrences  as 

{AAT)yk=ykT{p  and  (ATA)zk=zkrp. 

The  recurrences  do  not  give  us  a  way  to  represent  multiplication  by  A  and  AT  in 
these  new  bases,  which  are  necessary  to  perforin  updates  to  the  coordinate  vectors 
Vj+1  and  p'j_ |_i-  The  recurrences  do  however  give  ways  to  multiply  by  AAT  and  ATA, 
and  we  introduce  auxiliary  quantities  to  make  use  of  this.  Let 

Psk+j+i  —  A  Psiz+j+i  =  A  [Avsk+j+i  9sk+j+iPsk+j ^  /  psk+j+i 

=  ((A  A^vsk+j+i  9 sk-\-j+iPsk+j ) /psk+j+i ,  and 
^sk+j+1  =  Avsk+j  +  l  =  A(A  psk+j  psk+j^sk+j ^)  /  9  sk+j+1 

=  (( AA  ^psk+j  Psk-kj^sk+j )  /9sk+j+l  ■ 

Then  vector  updates  can  then  be  written 

V sk-\-j-\-l  —  Psk-\-j  Psk-\-j  Vsk+j  -> 

'Vsk-\-j-\- 1  —  ((-^-^  ^)Psk-\-j  Psk-\-j'Vsk-\-j)/®sk-\-j-\-l) 

Psk-\-j-\- 1  =  Vsk+j+l  @  sk-\- j -\-lP  sk-\- j 

for  j  G  { 1 , . . . ,  5},  and 

Psk+j+1  =  (( AtA)v  sfc+j  +  l  9 sk+j+lPsk+j)  /  Psk+j  +  1 

for  j  G  {1, . . . ,  s  -  1}  {psk+s- 1-1  is  not  needed).  As  before,  vsk+j+i  =  vsk+j+i/9sk+j+i 
and  Psk+j+i  =  Psk+j+l/ Psk+j+i-  Note  that  vsk+j+1  G  yk  for  j  G  {l,...,s}  and 
Psk+j+i  G  Zk  for  j  G  {l,...,s  —  1},  so  no  additional  basis  vectors  are  required  to 
represent  updates  to  these  auxiliary  quantities.  The  classical  version  of  this  modified 
upper  bidiagonalization  algorithm  is  given  in  Alg.  3. 

We  can  then  represent  vsk+j+ 1,  Usfc+j+i,  psk+j+ 1,  and  psk+j+i  by  their  coordi¬ 
nates  v'j+1,  vj+i,  Pj+ 1,  and  pj+1,  resp.,  in  yk  and  Zk,  i.e., 

Vsk+j+1  =  ZkVj+1 , 

Vsk+j+i  =  ykVj+i,  and 

Psk+j+i  =  ykp'J+ 1,  for  j  G  {1, . . . ,  s},  and 

Psk+j+i  =  zkPj+ 1  for  j  G  {1, . . . ,  s  -  1}.  (4.2) 

Note  that  using  (4.2),  in  each  new  outer  loop  we  initialize  the  coordinate  vectors  to 
Pi  =  ei,  v\  =  ei,  p[  =  es+i,  and  v[  =  es+2,  and  update  them  in  each  iteration  by  the 
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Algorithm  3  Lanczos  upper  bidiagonalization  with  auxiliary  quantities 
Require:  m-by-n  matrix  A  and  length-n  vector  b 
l:  6»!  =  ||At6||2,  Vi  =  ATb/0i,  p1  =  Avk,  px  =  ||pi||2,  Pi  =Pi/pi 
2:  V 1  =  Av  1,  pi  =  ATpi 
3:  for  i  =  1,2,...  until  convergence  do 
4:  Wi+l  =  pi  -  PiUi 

5:  $?:+l  =  ||Uj+l||2 

6:  Vi+i  =  Vi+i/di+i 

7:  Vi+t  =  ( AATpi  -  PiVi)/ei+i 

8:  pi+i  =  Vi+i  -  9i+iPi 

9:  Pi+ 1  =  ||P*+l||2 

10:  pi+i  =  pi+1/pi+i 

11:  pi  +  i  =  (ATAvi+ 1  -  6»i+ip,:)/pi+i 

12:  end  for 

formulas 

Vj+ 1  =  Pj  —  Psk+jVj , 
vj+ 1  =  Vj+l/Qsk+j+l, 

&j+ 1  =  (-^fc  Pj  ~  Psk+jVj) /9sk+j+ 1, 

=  Vj+1  -  iPp  and 

Pj+l  =  Pj+l/  Psk+j+l, 

for  j  £  {1, . . . ,  s},  and 

Pj+l  —  (^fc  '>Vj+ 1  —  ^sfc+j+lPj)/Psfe+j+l> 

for  j  £  {1,..  .  ,s  -  1}. 

Now,  it  remains  to  determine  how  to  compute  the  inner  products  9sk+j+ 1  and 
Psk+j+i  ■  We  can  write 

9  sk+ j  +1  =  (^sfe+j+l^sfe+j+l)  ^ 

=  {{Zkv'j+i)T  {Zkv'j+i)f/2 

=  {v?+1ZZZkv'J+i)1'2  (4.3) 

and 

Psk+j+l  =  {Psk+j+lPsk+j  +  l) 

=  {(ykp'3+1)T  (ykp'3+i))l/2 

=  (p'3T+1y^ykP'j+ r)1/2.  (4.4) 

Defining  the  Gram  matrices 

G(?)=yZyk  and  G[z)=ZlZk , 

we  can  compute  (4.3)  and  (4.4)  by  the  formulas 

9sk+j+ i  =  (u'+i  G[Z)v'j+1)1/2  and  psk+j+i  =  {pf+iG^  p'j+  i)1/2. 
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The  resulting  communication-avoiding  version  of  Algorithm  3  is  shown  in  Algo¬ 
rithm  4.  Note  that  in  lines  19  and  20  of  Algorithm  3,  we  have  shown  how  to  recover  all 
vectors  that  would  be  computed  in  the  s  iterations.  For  correctness  of  the  algorithm 
as  shown,  only  the  vectors  for  the  most  recent  iteration  need  be  recovered  for  use  in 
the  next  outer  loop. 


Algorithm  4  Communication-avoiding  Lanczos  upper  bicliagonalization  with  auxil¬ 
iary  quantities 

Require:  m-by-n  matrix  A  and  length-n  vector  b 
i:  9i  =  ||At6||2,  vi  =  ATb/9i,  pi  =  Av\,  pi  =  ||pi||2,  Pi  =  Pi/ Pi 
2:  Vi  =  Av i,  pi  =  ATpi 
3:  for  k  =  0, 1, . . .  until  convergence  do 

4:  Compute  Vfc,  a  basis  for  ICS(AT  A,vsk+i),  Vfc,  a  basis  for  JCS(AAT ,  Avsk+i), 

Vk,  a  basis  for  K.a+i{AAT  ,psk+i),  and  Vk ,  a  basis  for  ICS(AT A,  ATpsk+ 1).  Let 
yk  =  [Vk,Vk],Zk  =  [Vk,-Pk}. 

5:  g^]  =  ylyk  g[z)  =  z//zk 

6:  v'l  =  ei,  p'l  =  ei,  v'l  =  es+2,  p\  =  es+i. 

7:  for  j  =  1, . . . ,  s  do 

8:  v'j  +  i  =  P'j  -  Psk+jV'j 

9:  0sk+j  +  l  =  ( Vj+iGfr  ^Vj+iJ 

10;  Vj+ 1  =  Vj+1/9sk+j+ 1 

11:  Vj+:l  =  -  Psk+jVj)/9sk+j+ 1 

12:  Pj+1  =  ^j  + 1  —  ^«fc+j+lPj 

13:  pak+j+ 1  =  ^P'j+ 

14:  Pj+1  =  Pj+l/psk+j+1 

15:  if  j  <  s  then 

16:  Pj  +  1  =  (-^fc  vj  + 1  ~  9sk+j+lPj)/ Psk+j+1 

17:  end  if 

18:  end  for 

19:  [l^sfc-t-2;  •  *  ■  j  'I’sfc+s-t-l]  ^k[v2i  •  •  *  5 

20:  \psk+ 2,  ■  ■  ■  ,psfc+s+i]  =  yk\P2,  ■  ■  ■  ,Ps  + 1] 

21:  end  for 


4.2.2.  Reduction  to  lower  bidiagonal  form.  Now  assume  we  are  beginning 
iteration  sk  +  1  of  Algorithm  2,  where  k  £  N  and  0  <  s  €  N,  so  that  wsfc+i  and  i>sfc+i 
have  just  been  computed.  Recall  that 


V‘sk+j+ 1  £  Kjs(AA  ,  Avsk-\-i )  T KLgi^AA  ,usk-\-i)  and 
Vsk+j+i  £  K--s+i(A  -(-  KS(A  A,  A  usk-j-i)y 


for  j  £  {0,...,s}. 

We  then  define  basis  matrices  whose  columns  span  the  desired  subspaces  as  fol¬ 
lows.  Let  Uk  be  a  basis  for  ICS(AAT,  uak+i),  Vk  a  basis  for  1CS(AAT ,  Avsk+i),  Vk  a 
basis  for  tCs+i(AT  A,vsk+i)  and  Uk  a  basis  for  ICS(AT  A,  ATusk+ 1).  Assuming  these 
polynomial  bases  are  generated  using  a  three-term  recurrence,  we  can  write  the  re- 
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currence  relations 


(AAT)[Uk,0,Vk,0\ 

(ATA)[Vk,0,Uk:0] 


[ukyk] 


[VkMk] 


o] 

0 

pf\o] 

0 


0 

pf\o]_ 

0 

pf\  0]_ 


and 


where  Vfc,  Vfc,  and  are  the  same  as  Uk,  14,  14,  and  Uk ,  resp.,  but  with  the 
last  column  removed,  and  the  Tk  matrices  are  tridiagonal  matrices  of  the  form  given 
in  (4.1)  with  *  =  s  for  and  *  =  s  —  1  for  TkU\  Tk^\  and  Tjf1'1 .  As  before  the 
entries  dj,  7 j,  and  can  be  different  depending  on  whether  we  are  constructing 
polynomials  in  AAT  or  AT A  and  could  be  refined  with  each  new  outer  loop. 

To  simplify  notation,  we  will  define  34  =  [Z4,14],  y_k  =  [Wfe,0,Vfe,0],  Z k  = 
[Vk,Uz\,Zk  =  [Vkl  0,W*,0],  and 


rpiy)  — 
1k  — 


Pf\  0] 

0 


0 

pf\o]_ 


rp(%)  — 

1k  ~ 


pf\  0] 

0 


0 

pf\  0]_  ' 


This  lets  us  rewrite  the  recurrences  as 

{AAr)yk=ykT^)  and 
(ATA)Zk  =  ZkT^\ 


Note  that  these  are  the  same  recurrences  used  in  the  communication- avoiding  upper 
bidiagonalization  method  of  the  previous  subsection,  but  with  different  definitions  of 
34,  >24,  34,  and  Zk-  Again,  we  introduce  auxiliary  quantities.  Let 

Usk+j+l  =  A  Usk+j+1  =  A  (Avsk-\-j  QLsk-\-jUsk-\-j  'j  I  (3sk-\-j-\-l 

—  ((A  A^Vsk+j  &sk-\-j'U/sk-\-jS) / 1,  and 

Vsk+j  +  l  =  A'l?sfc_J_j_|_l  =  A(A  'Msfe+J  +  l  /^sfc+J  +  l^sfc+J  )/^sfe+J  +  l 

=  ((AA  /^s/c+j+l^sfc+j)  /^sfe+j+l  • 

Then  the  vector  updates  become 


/U'sk-\-j- 1-1  —  (^sfc+j  &sk-{-j ,'U'sk-\-j  ) , 

^sfe+j+i  ((A  A)us^_|_j  ) /As/c+j+i ,  and 

Vsk+j+l  —  (w,sk-\-j  ftsk+j+lVsk+j ), 


for  j  G  {1, . . . ,  s},  and 

^sfc+j+l  —  ((AA7  )^s/c+j4-l  ^s/c+j+l^s/c+j)  / ^sfc+j  +  1  > 

for  j  G  {1, . . . ,  s  —  1}.  As  before,  usk+j+ 1  =  usfc+j+i//?sfc+J+i  and  i>sfc+:,'+i  = 
vsk+j+\  / oisk+j+i-  The  classical  version  of  this  modified  lower  bidiagonalization  al¬ 
gorithm  is  given  in  Algorithm  5. 

Note  that  in  Algorithm  5,  usk+j+ 1  G  Zk  for  j  G  {0, . . . ,  s},  and  vsk+j+ 1  G  yk  for 
j  G  {0, . . . ,  s  -  1}.  Then  we  can  represent  usk+j+ 1,  usk+j+ 1,  vsk+j+ 1,  and  vsk+j+i  by 
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Algorithm  5  Lanczos  lower  bicliagonalization  with  auxiliary  quantities 
Require:  m-by-n  matrix  A  and  length-n  starting  vector  b 

i:  Pi  =  H&H2,  «i  =  b/p  1,  vx  =  ATm,  ax  =  H^i ||2,  vi  =  n/ax 

2:  Ux  =  AT'Ux,  Vx  =  Avx 
3:  for  i  =  1,2,...  until  convergence  do 
4:  ui+ 1  =  Dj  -  cqiq 

5:  /3*+l  =  ||Mi+l||2 

6:  Ui+x  =  Ui+l/Pi+X 

7:  u»+1  =  {ATAvi  -  aiUi)/Pi+1 

8:  Vi+ 1  =  «,  .  1  —  Pi+xVi 

9:  Cli+l  =  ll^i+l  H2 

10:  Vi+1  =  Vi+x/ai+x 

11:  vt+x  =  (AATui+ x  -  Pi+xVi) /ai+x 

12:  end  for 

their  coordinates  u'+1,  Uj+ 1,  e,.;  1,  and  Vj+i,  resp.,  in  34  and  Z k,  i.e. , 


=  34^  +  1, 

=  -ZfeMj+i, 

and 

Vsk-\-j-\- 1 

for  j  G  {1,.. 

.  ,s},  and 

=  34^+1 

for  j  G  {!,.. 

Note  that  using  (4.5),  in  each  new  outer  loop  we  initialize  the  coordinate  vectors  to 
Ui  =  ex,  v'x  =  ex,  u'x  =  es+2,  and  v[  =  es+i,  and  update  them  in  each  iteration  by  the 
formulas 


U'j- (-1  —  ^  j  Qsk-\-j'Ujj’> 

uj+ 1  =  ^j+l/  Psk+j+l, 

Mj+1  =  (74  Vj  —  ask+jUj)  / Psk+j+l, 

v'j+1  =  Uj+x  -  Psk+j+iVj,  and 

Mj+l  =  Vj+l/ask+j+l, 

for  j  G  {1, . . 

. ,  s},  and 

Vj+ 1  =  —  /^sfc+j+iu^/asfc+j+i 

for  j  G  {1, . . 

,,s-  1}. 

Now,  it  only  remains  to  determine  how  to  compute  the  inner  products  Psk+:j+\ 
and  ask+j+ 1-  We  can  write 

Psk+j+l  =  (^sk+j+l^sk+j+l)  ! 

=  ((^m'+i)T(^m'+1))1/2 

=  K^^m'+1)1/2  (4.6) 

and 

ask+j+ 1  =  (f'sfe+j+i'^sfc+j+i)  ^ 

=  ((Zkv'j+x)T(Zkv'j+1))1/2 
=  {v?+lZZZkv'j+x)112. 
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(4.7) 


Defining  the  Gram  matrices 

Gf }  =  yl  34  and  G[Z)  =  Zl  Zk , 

equations  (4.6)  and  (4.7)  become 

Psk+j+1  =  (w'^1Gp;)u'+1)1/2  and  ask+j+1  =  (n'+1G^)u'+1)1/2. 

The  resulting  communication-avoiding  version  of  Algorithm  5  is  shown  in  Algorithm  6. 
Again  note  that  in  lines  19  and  20  of  Algorithm  5,  we  show  recovery  of  all  vectors 
that  would  be  computed  in  the  s  iterations,  although  only  the  vectors  from  the  most 
recent  iteration  need  be  recovered  for  correctness. 


Algorithm  6  Communication-avoiding  Lanczos  lower  bidiagonalization  with  auxil¬ 
iary  quantities 

Require:  m-by-n  matrix  A  and  length-n  starting  vector  b 
i:  /? i  =  ||&||2,  ux  =  b//3i,  vx  =  ATux,  ax  =  ||vi||2,  vi  =  vx/ax 
2:  Ux  =  ATUx,  Vx  =  Avx 
3:  for  k  =  0, 1, . . .  until  convergence  do 

4:  Compute  Uk,  a  basis  for  1CS(AAT  ,usk+ 1),  Vk,  a  basis  for  ICS(AAT ,  Avsk+ 1), 

Vfc,  a  basis  for  ICs+x{AT  A,vsk+i),  and  Uk,  a  basis  for  ICS(AT  A,  ATusk+ 1).  Let 
yk  =  [Uk,Vk],  Zk  =  [Vk,Uk]. 

5:  g^)  =  3f  34  G{z)  =  z^zk 

6:  u'x  =  ex,  v'x  =  ex,  u[  =  es+2,  v[  =  es+1- 

7:  for  j  =  1 , ,s  do 

8:  u'j+1  =  Vj  -  dsk+jUj 

9:  Psk+j+1  =  ( ’“j  +  lGfc 

10:  Uj+ 1  =  Uj+1/  (3ak+j+x 

ll:  ^j+ 1  ( Vj  exskjrjUj)/f3skjrjjr  1 

12:  Vj+x  =  U'j+1  -  Psk+j+ XVj 

13:  ask+j+ 1  =  ( Vj^iG [.  ^Vj  + 

i4:  Vj+l  =  v'j+x/ask+j+ x 

15:  if  j  <  s  then 

16:  Vj  +  X  =  (Tk  ^ U'j+X  ~  Psk+j+lVj)/ask+j+l 

17:  end  if 

18:  end  for 

19:  [usk+2,  ■  ■  ■  >7tsfc+s  +  l]  34  [w2,  ■  ■  ■  ,  us+i] 

20:  [vsk+2,  ■  ■  .,Vsk+s+1]  =  Zk[v 2,  .  .  •  ,f'  +  1] 

21:  end  for 


4.3.  Alternating  matrix  powers.  Another  communication-avoiding  variant 
can  be  derived  which  builds  two  coupled  Krylov  bases,  where  basis  vectors  are  com¬ 
puted  by  alternating  between  multiplication  by  A  and  by  AT .  We  still  need  to  obtain 
4s  + 1  basis  vectors  in  order  to  take  s  steps  of  the  algorithm,  but  in  this  case  we  do  not 
need  to  construct  or  multiply  by  AT  A  and  AAT .  It  is  less  clear  how  to  choose  poly¬ 
nomial  basis  parameters  (entries  of  Tk)  in  this  case.  Numerical  and  performance  com¬ 
parisons  between  these  versions  and  the  communication-avoiding  versions  discussed 
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in  Section  4.2  remains  future  work.  Note,  as  before,  in  both  algorithms  derived  below 
we  show  recovery  of  all  iteration  vectors  after  each  inner  loop,  although  only  the  last 
vectors  are  needed  to  begin  the  next  outer  loop. 

4.3.1.  Reduction  to  upper  bidiagonal  form.  Another  version  can  be  derived 
by  using  coupled  recurrences  to  generate  bases  for  vsk+j+ 1  and  Psk+j+x  ■  Recall  that 

for  j  e  {0, 

Psk+j-\- 1  ^  ^4+ i  ( AA  ,Psfc+i)  KS(AA  ,  and 

^sk+j-\- 1  £  fcs(A  A,vsk- j-i)  -f  ICS(A  A,  A  Ps^+i)- 


Then  assume  that  we  have  a  2s  dimensional  basis  Z k  such  that 

and  a  2s  4-  1  dimensional  basis  34  such  that  psfc+j+i  =  34p}+i  for  j  €  {1, . . . ,  s},  and 

that  these  bases  satisfy  the  recurrences 

AZk  =  ykTk  and  ATyk  =  ZkTk, 

where  Zk  and  34  are  the  same  as  Zk  and  34,  resp.,  but  with  the  last  column  removed. 
The  matrices  Tk  and  Tk  are  tridiagonal  matrices  of  the  form  given  in  (4.1).  Given 
z\  =  vsk+i  and  y\  =  psk+ i,  the  columns  of  Zk  and  34  can  be  generated  by,  e.g., 
computing 


y2  =  {Azi  -  diyi)/7i, 

^2  =  (ATi/i  -  di2i)/7i,  and 
yt+  i  =  (Azg  -  atyt  -  bi-\yt-\)!li  for  t  e  {2, . . . ,  2s  4- 1}, 
ze+  i  =  ( ATye  -  agzg  -  j3e-izt-i)/jt  for  i  e  {2, ... ,  2s}, 

where  zg  and  vg  denote  the  ftli  columns  of  Zk  and  yk,  respectively.  Note  that  above, 
the  coefficients  ag,  fig,  and  jg  could  be  different  for  computation  of  yg+\  and  zg+ 1- 
Then 


—  A  T 

Vsk+j+1  —  A  Psk+j  Psk+j^sk+j 
zkVj+i  =  ATykp'0  -  Psk+jZkVj 

=  ZkTkpj  psk+jZkVj 


and 


Psk+j  +  l  —  Avsk-\-j-\-X  ^sk+j+lPsk+j 

34Pj+l  =  AZ_kVj^_ i  4s7,:+j  +  'l  ykP  j 

—  y k  t  \  kj 4_  i  d.s7,-+/+i  Tap,  • 

Therefore  in  the  inner  loop  we  can  update 

Vj+ 1  =  Tkp'j  -  Psk+jv'j  and 
Pj+i  =  TkVj+l  —  0sfc+j+iPj, 

and  recover  the  iteration  vectors  by 

Vsk+j+i  =  -AkV j+\  and  p  sk+j +\  =  43:  Pj  4-  j . 
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for  j  G  {1,  •  ••,«}• 

The  scalars  required  for  normalization  can  be  computed  by 

9sk+j+ 1  =  ||w»fc+j+i||2  =  {vj+i(Zk  Zk)Vj+1)  =  (jjj+1Gk  and 

Psk+j+1  =  \\psk+j+ih  =  {p'?+i(ylyk)p'j+i)1/2  =  (p'^G^p'+i)  7  , 
and  then 


v'j+ 1  =  v'j+l/9sk+j+ i  and 

2*7+1  =  Pj  +  lt  Psk+j+1- 

The  resulting  communication-avoiding  upper  bidiagonalization  algorithm  is  shown 
in  Algorithm  7. 


Algorithm  7  Communication-avoiding  upper  bidiagonalization  with  alternating  ma¬ 
trix  powers 

Require:  m-by-n  matrix  A  and  length-n  vector  b 
1:  01  =  ||At?;||2,  Vi  =  ATb/dl,  pi  =  Avx,  px  =  ||pi||2,  Pi  =Pi/pi 
2:  for  k  =  0, 1, . . .  until  convergence  do 

3:  Compute  Zk  and  34  such  that  AZ_k  =  34^4  and  ATyk  =  ZkTk . 

4:  g[z)  =  z£zk,  g f }  -  y£yk 

5:  v[  =  ei,  pi  =  ei 

6:  for  j  =  1, . . . ,  s  do 

7:  Vj+l  =  Tkpb  -  Psk+jVj 

8:  &sk+j+ 1  =  (jjj+1  Gk 

9:  vj  + 1  =  Vj+l/@sk+j+l 

19-  Pj+1  ^k^j+l  Psk+j+lPj 

11:  Psk+j  +  1  =  (p'^iG^)p'+1)  7 

12:  P'j  +  l  =  Pj+l/ Psk+j+1 

13:  end  for 

14:  [vsk+2,  ■  ■  ■  jUsfc+s+l]  =  Zk\l>2,  ■  ■  •  ,Us-|-i] 

15:  [Psfc+2;  .  .  .  ,Psfc+s+l]  =  34 [P2,  ■  •  •  >Ps  +  l] 

16:  end  for 


4.3.2.  Reduction  to  lower  bidiagonal  form.  Now  we  derive  a  version  of 
lower  bidiagonalization  using  coupled  recurrences  to  generate  bases  for  usk+j+i  and 
vsk+j+i ■  Recall  that  for  j  £  {0, . . . ,  s}, 

Vsk+j+i  G  KLS^AA  ,  Avsk+i)  T  KLS{^AA  ,  i)  and 
Vsk+j+i  G  )Cs+i(AT A,vsk+i)  +  K.S(AT A,  ATusk+ 1). 

Then  assume  that  we  have  a  2s  dimensional  basis  34  such  that  usk+j+ i  =  34*4+1 
and  a  2s  +  1  dimensional  basis  Zk  such  that  vsk+j+ i  =  Zkv'j+1  for  j  £  {1, ,  s},  and 
that  these  bases  satisfy  the  recurrences 

AZk  =  ykTk  and  ATyk  =  ZkTk, 
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where  Z_k  and  y_k  are  the  same  as  Zk  and  34  but  with  the  last  column  removed. 
Given  usk+ 1  and  vsk+i,  the  columns  of  34  and  Zk  can  be  generated  by  computing 

y2  =  ( Az i  -  diyi)/7i, 

22  =  (ATyi  -  di2i)/7i,  and 
ye+  i  =  (Azi  -  aeye  -  ^e-iye-i)/lt  for  £  G  {2,..  .,2s}, 
ze+i  =  ( ATye  -  atzt  -  Pe-xze-x)/^  for  l  e  {2, . . . ,  2s  +  1}, 

where  zt  and  ve  denote  the  ttli  columns  of  Zk  and  34,  respectively.  Note  that  above, 
as  in  the  upper  bidiagonalization  case,  the  coefficients  &e,  $(.,  and  7^  can  be  different 
for  computation  of  ye+i  and  2^+1. 

Then 

lU,sk-\-j+ 1  =  Avsk-\.j 
yku'j+i  =  AZbVj 

=  ykTkv'3 

and 

Vsk+j  +  l  =  A  'll S — j—y  — | —  '|  (5sk+j-\-lVsk+j 

^'k'^j-ir\ =  a  y k  Uj 1  zkVj 

=  ZkTkUj+1  /^sfc+j+1  ZkVj . 

Therefore  in  the  inner  loop  we  can  update 

u'j+ 1  =  ThVj  -  ash+jUj  and 
^j+l  =  A:  U  j  _|_ 1  —  Psk+j+lVji 

and  recover  the  iteration  vectors  by 

Usk+j+i  —  34 iij  -f  1  and  Vsk+j+'i  =  Zk  Vj _j_  | . 

for  j  G  {1, 

The  scalars  required  for  normalization  can  be  computed  by 

1^2 

Psk+j+1  =  \\usk+j+ih  =  {uf+i(ylyk)u'j+i)1/2  =  {uf+lG{pu'j+^j  and 
usk+j+ 1  =  \\vsk+j+1\\2  =  (v'jliiZk  Zk)v'j+1)1/2  =  (v'£.1G<f)v'j+1')  7  , 
and  then 

u'j+ 1  =  u'j+1//3sk+j+ 1  and 

vj  + 1  =  ^j  +  l/asfe+j  +  l- 

The  resulting  communication-avoiding  lower  bidiagonalization  algorithm  is  shown 
in  Algorithm  8. 


&sk+j'U'sk-\-j 

^sk+j^k^j 

^sk+j^k^j 
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Algorithm  8  Communication-avoiding  lower  bidiagonalization  with  alternating  rna- 

trix  powers 

Require:  m-by-n  matrix  A  and  length-n  starting  vector  b 

1 

01  =  ||&||2,  «1  =  b/P  1,  V!  =  ATU!, 

011  =  IKH2,  vi  =  vi/ai 

2 

for  k  =  0, 1, . . .  until  convergence 

do 

3 

Compute  Zk  and  34  such  that  AZfc  =  ykTk  and  ATy,  =  ZkTk. 

4 

G[z)  =  zlzk,  G f }  =  y?yk 

5 

v[  =  ei,  u[  =  ei 

6 

for  j  =  1 , . . . ,  s  do 

7 

u'j+ 1  =  TkVj  -  ask+ju'j 

8 

Psk+J+1  =  (uf+1G^]u'3+^ 

1/2 

9 

uj  + 1  =  ^j+l/ Psk+j  +  l 

10 

Vj  +  1  ~  Tkuj+ 1  —  Psk+j+lvj 

11 

OCsk+j  +  1  =  (vj+1  Gk  ^Vj+ 

1/2 

12 

vj+ 1  =  Vj+l/ask+j+l 

13 

end  for 

14 

l'U'sk+2  ?  •  •  •  j  Hsfc+s+l]  —  34  ['U 2  >  * 

■  •  >  <+i] 

15 

[Xsk+2i  ■  •  •  7  Ibfc+s-t-l]  Zk  [l'2 ,  ■ 

•  X+i] 

16 

end  for 

5.  Communication-avoiding  LSQR.  Paige  and  Saunders  [19]  showed  that 
the  quantities  generated  by  the  lower  bidiagonalization  procedure  in  Algorithm  2  can 
be  used  to  solve  the  least-squares  problem  min||6  —  We  briefly  review  the 

rationale  behind  the  LSQR  algorithm  given  by  Paige  and  Saunders.  For  some  vector 
yi,  define  the  quantities 


Xi  =  ViVi, 

Vi  =  b  —  Axi ,  and 

U+i  Pio-i  Biyi. 

Since  for  the  lower  bidiagonalization  procedure  we  have  J7j+i(/3iei)  =  b  and  AV \  = 
Ui+iBi,  it  follows  that  r,  =  U.i+iti+i,  and  since  Ui+ ±  is  orthonormal,  this  suggests 
choosing  y.i  such  that  ||t,+i||2  is  minimized,  which  gives  the  least-squares  problem 
min  \\f3\e\  -  ||2. 

This  problem  is  solved  by  updating  the  QR  factorization  of  Bj  in  each  iteration, 
given  by 


Qi[Bi  ,0i  ei] 


Ri 


where  Ri  is  the  upper  bidiagonal  matrix  produced  by  Algorithm  1.  (Coinciden¬ 
tally,  this  factorization  provides  a  link  between  the  two  bidiagonalization  proce¬ 
dures;  see  [19]).  Above,  Qi  is  the  product  of  a  series  of  plane  rotations,  i.e.,  Qi  = 
Qi, i+i  ■  •  ■  Q-2,3Qi,2-  We  then  have 

Xi  =  ViR-'fi  =  Dift, 
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where  the  columns  of  D*  can  be  found  successively  by  forward  substitution  on  the 
system  Rf  Dj  =  V? .  This  gives 

di  =  (1  /pi)(vi  -  Oidi-i)  and 

Xi  —  Xi—  1  T  4>idi, 


where  do  =  =  0. 

The  QR  factorization  is  determined  by  constructing  the  zth  plane  rotation  Qiti+i 
to  operate  on  rows  i  and  i  +  1  of  the  transformed  [Bi  /+1]  and  eliminate  fii+i-  This 
recurrence  relation  can  be  written 


Pi  0 

Pi  ^2+1  02 

Si  Ci 

A+i  ^2+1  o 

0  1  02+ 1_ 

where  pi  =  aq,  4> i  =  +  ,  and  Ci  and  Si  are  the  elements  of  Qzy+i-  Note  that  s 
without  a  subscript  still  denotes  the  iteration  blocking  factor.  In  the  algorithm, 
vectors  m,  =  p.jdj  are  computed  instead  of  di.  As  in  the  previous  section,  quantities 
with  bars  denote  intermediate  variables. 

Thus,  the  LSQR  algorithm  proceeds  as  follows.  One  begins  by  setting 

+  =/?i,  pi  =  ati,  wi=vi,  and  Xi  =  0n,i, 

and  proceeds  with  the  Lanczos  lower  bidiagonalization  process  (Algorithm  2).  In  each 
iteration,  after  fii+i ,  a*+i,  and  Vj+i  have  been  computed  via  the  bidiagonalization 
process,  one  updates 

Pi  =  (Pi  +Pf+ 1)1/2, 

Ci  =  pi/ Pi, 

$i  A+l/Pi, 

@i-\- 1  —  SiOLi-\.  i, 

Pi+ 1 
<i>i+ 1 

Xi+l 

Wi+1 

Ti+1 

The  resulting  algorithm  is  shown  in  Algorithm  9.  Any  of  the  communication-avoiding 
variants  of  the  lower  bidiagonalization  algorithm  given  in  Section  4  can  be  adapted 
to  give  a  communication-avoiding  version  of  LSQR.  In  Algorithm  11  we  show  a  CA- 
LSQR  method  based  on  the  implementation  in  Algorithm  6.  For  reference,  we  give  the 
intermediate  step  in  obtaining  this  new  method,  a  classical  LSQR  algorithm  which 
uses  auxiliary  quantities,  in  Algorithm  10.  In  Algorithm  12,  we  give  a  CA-LSQR 
method  using  the  alternating  matrix  powers  approach  of  the  bidiagonalization  in 
Algorithm  8. 

Note  that  in  Algorithm  11  and  12,  as  in  the  previous  section,  although  we  have 
shown  the  recovery  of  all  iteration  vectors  for  all  s  iterations  at  the  end  of  each  outer 
loop,  only  iteration  vectors  for  the  last  of  the  s  iterations  need  be  computed. 


—  Ci4*i  i 

. 

=  Xi  H - Wi , 

Pi 

Oi+l  , 

=  vi+i - Wi,  and 

Pi 

=  ||6  -  Axi+ 1||2. 
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Algorithm  9  LSQR 

Require:  m- by-n  matrix  A  and  length-n  starting  vector  b 

i:  ft  =  II&II2,  «i  =  b/ft>  ft  =  ai  =  Uplift  Vi  =  ft/ai 
2:  Si  =  ATU1,  ft  =  Av  1 

3:  ft  =  ft  Pi  =  ai  Wi  =  Vi  Xi  =  0„,1 
4:  for  i  =  1,2,...  until  convergence  do 

5:  ft+1  =  AUi  -  UiUi 

Pi+1  =  ('^i+l'^i+lj 
7:  ft+1  =  ft+l/ft+1 

8:  ft+1  =  ATUi+1  -  ft+lft 

9:  ai+1  =  (vf+1vi+1)1/~ 

10;  ft+1  ft+l/ft+1 

11:  Pi  =  (A? +ft2+i)1/2 

12:  ft  —  Pi/ Pi)  Si  ft  +  l/  Pi 

13:  ft+1  —  S^ft+i,  Pi+1  —  ftft+1 

14:  ft  —  ftft ,  ft+1  —  Sift 

15:  ft+l  =  ft  +  (<t>i/pi)  Wi 

16:  Wi+i  =  ft+i  -  (ft+l/pi)  Wi 

17:  end  for 


Algorithm  10  LSQR  with  auxiliary  quantities 
Require:  m-by-n  matrix  A  and  length-n  starting  vector  b 

l:  ft  =  II&II2,  «i  =  b/fti  ft  =  ^+1,  ai  =  ||ft||2,  ft  =  ft/ft 

2:  Ui  =  ATU\1  ft  =  Av  1 

3:  ft  =  ft,  Pi  =04,  Wi  =  ft,  ft  =  0n,i 

4:  for  i  =  1,2,...  until  convergence  do 

5:  ^i+1  QLjlLi 

o  ( -T  -  \  1/2 

6:  Mi+1  = 

7:  ft+1  =  ft+l/ft+1 

8:  ft+i  =  {AT Avi  -  a.iUi)/(3i+i 

9:  ft+1  =  ft+1  ^  ft+lft 

10:  ai+1  =  (ft+ft+l)17" 

11:  ft+ 1  =  ft+l/aj+i 

12:  ft+i  =  (AATui+ 1  -  ft+ift)/ai+i 

13:  Pi  =  (pf  +ft?+1)l/2 

14:  ft  Pi/ pi)  Si  ft+]  / Pi 

15:  ft+1  SiOi+i,  Pi+1  CiQi+i 

16:  (f)i  —  ftft,  ft+1  —  Sift 

17:  ft+1  =  Xi  +  (ft/pi)  Wi 

18:  W»+i  =  Vi+i  -  (ft+i /pi)  Wi 

19:  end  for 


6.  Future  work.  In  this  manuscript,  we  have  derived  three  communication¬ 
avoiding  variants  of  both  upper  and  lower  Lanczos  bidiagonalization  procedures,  and 
have  given  two  corresponding  versions  of  communication-avoiding  LSQR  solvers.  Fu¬ 
ture  work  involves  evaluation  of  both  the  convergence  and  stability  problems  of  the 
various  communication-avoiding  methods  presented  here  for  a  variety  of  different  prob- 
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Algorithm  11  CA-LSQR 


Require:  m- by-n  matrix  A  and  length-n  starting  vector  b 

i:  Pi  =  II&II2,  ui  =  b/ Pi,  vi  =  ATui,  0.1  =  ||vi||2,  vi  =  vi/ai 

2:  Ui  =  ATUi,  Vi  =  Av l 

3:  4>1  —  Pi ,  pi=Oi,  Wi=Vi,  Xi  =  0„,1 
4:  for  k  =  0, 1, . . .  until  convergence  do 

5:  Compute  Uk,  a  basis  for  K,s(AAr  ,usk+ 1),  Uk,  a  basis  for  fCs(AT A,  AT usk+ 1), 

14,  a  basis  for  ICs+i(AT  A,vsk+i),  and  Vk,  a  basis  for  ICS(AAT ,  Avsk+ i).  Let 


yk 

6: 

7: 


9: 

10: 


11: 

12: 

13: 

14: 


15: 

16: 

17: 

18: 

19: 

20: 

21: 

22: 

23: 


=  [Uk,Vk],  Zk  =  [Vk,Uk]. 

Cf”  =  Gf- 1  =  2J2* 

w'l  =  ei,  ui  =  ei,  u'i  =  es+2,  v[  =  es+1. 
w'l  =  [0i;2s+l,  1]T,  x'l  =  02S+2.1 
for  j  =  1 , . . . ,  s  do 
5  +  1  =  Vj  -  Osk+jUj 
a  ,  _  (f-.IT  V/2 

Psk+j  +  l  —  [Uj  +  i^k  Uj+1  J 

Uj+ 1  =  u'j+l/ Psk+j  +  1 

5+i  =  (TlZ)v'j  -  ask+ju'j)/Psk+j+ 1 
5+i  =  5+i  ~  Psk+j+Wj 

ask+j+i  =  ( 'v'7+iG^v'j^ )  7 

5+i =  5+i/a^+i+i 

5+i =  (-5  )5+i —  /3sfc+j+i5)/Q!«fc+j+i 

Psk+j  =  (Psfc+j  +  /^sfc+j+l)1^2 

csfc+j  =  Psk+j  / Ps  fe+j  i  &sk-\-j  —  Psk+j+l/  Psk+j 

Psk+j+1  Ssk+jOsk+j+ 1,  Ps/c+j  +  1  —  Csfc+jHsfc+j  +  1 

Psk+j  —  Csk+jPsk+j ,  Psk+j+l  Ssk+jPsk+j 

5+i  =  5  +  5 

5+1  =  [5+1  >  _  {@sk+j  + 1/ Psfc+j)  5 


24 

end  for 

25 

\'U,sk+ 2?  •  • 

■)  ^sfc  +  s+l] 

—  34[5>  •  ■  •  >  5+i] 

26 

2i  •  • 

5  ^sfc  +  S+l] 

=  zk\v’2,...,  5+J 

27 

[*^sfc+2?  •  • 

?  ^s/c+s+l] 

=  [Zk,wsk+ i\[x'2, . . 

1  *^s+l]  "f"  3?sfc+lll,s 

28 

sk-\-2)  •  • 

•  5  ^sfc  +  S  +  1 

=  [Zk,wsk+j][w 2, . 

•X+ 1] 

29 

end  for 

lems  (and  different  s  values),  as  well  as  a  performance  study  for  both  sequential  and 
parallel  versions.  It  is  not  clear  from  the  derivations  whether  one  method  will  always 
win  over  the  others  in  terms  of  best  speed  per  iteration  or  best  convergence  rate.  The 
correct  method  to  choose  will  depend  on  the  structure  and  conditioning  of  the  matrix, 
the  requirements  of  the  particular  application,  and  machine-specific  parameters  such 
as  cache  size  and  relative  latency/bandwidth  cost. 
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Algorithm  12  CA-LSQR  with  alternating  matrix  powers 
Require:  m-by-n  matrix  A  and  length-n  starting  vector  b 
i:  (h  =  \\b\\2,  «i  =  b//3 1,  vx  =  ATui,  ax  =  ||mi||2,  Vx  =  Vx/^x 
2-.<j>x=Px,  Pi  =  ax,  Wx=Vx,  Xx  =  0n,l 
3:  for  k  =  0, 1, . . .  until  convergence  do 

4:  Compute  Zk  and  34  such  that  AZk  =  ykTk  and  ATy_k  =  ZkTk. 

s:  g[2)  =  ztz,  G(p  =  yTy 

6:  v'x  =  ex,  u'x  =  ex 

7;  w'x  =  [Oi^s+l)  1]T,  x'x  =  02s+2,l 
8:  for  j  =  1 , ,s  do 

9:  u'j+ 1  =  TkVj  -  ask+jUj 

!0;  Psk+j+1  =  (u'^lxGk  ^j+ 1) 

11:  MJ  +  1  =  Mj_|_i//3sfe+j+l 

12:  Vj+1  =  4- 1  fisk+j+lVj 

13:  asfc+j+i  =  ^'-+1) 


uj+l  —  vj  +  l/ask+j+l 
Psk+j  =  (pik+j  +  Psk+j+ 1)1^2 

csfc+j  =  Psk+j  /  Ps  fc+j  i  Ssk-\-j  —  Psk+j+l/ Psk+j 

@sk+j+ 1  =  _  Psfc+j+l  4* k+j  11  s  fc  +  J + 1 

(psk+j  —  esk+j4*sk+j ,  (psk+j+1  —  &sk+j(psk+j 

xj+ 1  =  xj  d~  {(psk+j/ Psk+j )  Wj 
Wj+x  =  Wj+x,  0]T  -  (^sfe+j+i/Psfe+j) 


21 

end  for 

22 

[l^sfc+2  5  •  • 

■>  Hsfc+s+l] 

=  34[w,2,---X+l] 

23 

bsfc+2,  •  ■ 

5  ^sfc  +  S+l] 

=  Zk[v 2,  .  .  •  ,  Mg+i] 

24 

■>  ^sfc+s+l] 

=  [Zk,wsk+ i\[x'2, . . 

1  ^s+l]  %sk-\-l^-l,S 

25 

2  5  •  • 

•  5  ll^sfc  +  s  +  l 

=  [Zk,wsk+j][w'2, . 

■X+il 

26 

end  for 

Numbers  DE-SC0005136,  DE-SC0008699,  DE-SC0008700,  and  AC02-05CH11231;  by 
DARPA  Award  Number  HR0011-12-2-0016,  as  well  as  contributions  from  Intel,  Ora¬ 
cle,  and  MathWorks. 
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